Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow non-parenthesized lambda expr in f-string #7263

Merged
merged 21 commits into from
Sep 14, 2023

Conversation

dhruvmanila
Copy link
Member

@dhruvmanila dhruvmanila commented Sep 11, 2023

Summary

This PR updates the handling of disallowing non-parenthesized lambda expr in
f-strings.

Previously, the lexer was used to emit an empty FStringMiddle token in certain
cases for which there's no pattern in the parser to match. That would then raise
an unexpected token error while parsing.

This PR adds a new f-string error type LambdaWithoutParentheses. In cases
where the parser still can't detect the error, it's guaranteed to be caught by
the fact that there's no FStringMiddle token in the pattern.

Test Plan

Add test cases wherever we throw the LambdaWithoutParentheses error.

Benchmarks

As this is the final PR for the parser, I'm putting the parser benchmarks here:

group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec

@@ -1351,7 +1351,13 @@ NamedExpression: ast::ParenthesizedExpr = {
};

LambdaDef: ast::ParenthesizedExpr = {
<location:@L> "lambda" <location_args:@L> <parameters:ParameterList<UntypedParameter, StarUntypedParameter, StarUntypedParameter>?> <end_location_args:@R> ":" <body:Test<"all">> <end_location:@R> =>? {
<location:@L> "lambda" <location_args:@L> <parameters:ParameterList<UntypedParameter, StarUntypedParameter, StarUntypedParameter>?> <end_location_args:@R> ":" <fstring_middle:fstring_middle?> <body:Test<"all">> <end_location:@R> =>? {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add the fstring_middle optional token after the body but that creates conflict when generating. This makes sense and I guess the solution would be to modify Test<"all"> as you suggested in Discord but I don't think it's worth the complexity right now.

@dhruvmanila dhruvmanila added the parser Related to the parser label Sep 11, 2023
crates/ruff_python_ast/src/nodes.rs Show resolved Hide resolved
crates/ruff_python_parser/src/python.lalrpop Show resolved Hide resolved
Comment on lines -305 to -312
// This is to account for the empty `FStringMiddle` token that is created
// to check for non-parenthesized lambda expressions. `FStringMiddle` token
// is created for anything which is not part of the f-string expression nor
// an opening or closing brace. With that in mind take following example:
//
// ```python
// f"{lambda x:{x}}"
// ```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe move part of this description to the lambda parse rule?

Comment on lines +501 to +503
// TODO(dhruvmanila): The parser can't catch all cases of this error, but
// wherever it can, we'll display the correct error message.
/// A lambda expression without parentheses was encountered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This comment is confusing IMO because I understand that our Parser + Lexer accepts invalid input. I don't think that's the case. What you're trying to say is that the parser only accepts some of it. Maybe move the other comment from the Lexer here and explain which errors are captured by the lexer and which errors are captured by the parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's the case. What I mean here is that in all cases the parser won't accept a lambda expression without parentheses inside a f-string. It'll always throw an error in such scenarios. The problem I'm highlighting here is that the parser won't give an appropriate error in all scenarios but only in a few of them while in the others it'll just give "an unexpected token" error. I'll update the comment though.

crates/ruff_python_parser/src/string.rs Show resolved Hide resolved
dhruvmanila added a commit that referenced this pull request Sep 14, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
Base automatically changed from dhruv/fstring-parser-2 to dhruv/pep-701 September 14, 2023 02:07
@dhruvmanila dhruvmanila merged commit 648fe07 into dhruv/pep-701 Sep 14, 2023
@dhruvmanila dhruvmanila deleted the dhruv/fstring-parser-3 branch September 14, 2023 02:15
@codspeed-hq
Copy link

codspeed-hq bot commented Sep 14, 2023

CodSpeed Performance Report

Merging #7263 will degrade performances by 10.1%

⚠️ No base runs were found

Falling back to comparing dhruv/fstring-parser-3 (85e5c01) with main (04183b0)

Summary

❌ 8 regressions
✅ 17 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main dhruv/fstring-parser-3 Change
lexer[pydantic/types.py] 4.1 ms 4.6 ms -10.1%
lexer[numpy/ctypeslib.py] 2 ms 2.1 ms -8.07%
parser[large/dataset.py] 68.4 ms 70.2 ms -2.59%
lexer[unicode/pypinyin.py] 620.2 µs 676.3 µs -8.29%
parser[numpy/ctypeslib.py] 12.4 ms 12.7 ms -2.62%
lexer[large/dataset.py] 9.8 ms 10.8 ms -8.8%
lexer[numpy/globals.py] 233.2 µs 253.9 µs -8.17%
parser[unicode/pypinyin.py] 4.3 ms 4.4 ms -2.8%

@dhruvmanila dhruvmanila linked an issue Sep 15, 2023 that may be closed by this pull request
dhruvmanila added a commit that referenced this pull request Sep 18, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 18, 2023
## Summary

This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

## Test Plan

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

## Benchmarks

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 19, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 19, 2023
## Summary

This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

## Test Plan

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

## Benchmarks

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 20, 2023
## Summary

This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

### Grammar

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

### `string.rs`

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

### `Constant::kind` changed in the AST

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details> 

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details> 

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details> 

### Errors

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

## Test Plan

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

## Benchmarks

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 20, 2023
## Summary

This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

## Test Plan

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

## Benchmarks

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 22, 2023
This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 26, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 26, 2023
This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 27, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 27, 2023
This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 28, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 28, 2023
This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 29, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 29, 2023
This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
dhruvmanila added a commit that referenced this pull request Sep 29, 2023
This PR adds support for PEP 701 in the parser to use the new tokens
emitted by the lexer to construct the f-string node.

Without an official grammar, the f-strings were parsed manually. Now
that we've the specification, that is being used in the LALRPOP to parse
the f-strings.

This file includes the logic for parsing string literals and joining the
implicit string concatenation. Now that we don't require parsing
f-strings manually a lot of code involving the same is removed.

Earlier, there were 2 entry points to this module:
* `parse_string`: Used to parse a single string literal
* `parse_strings`: Used to parse strings which were implicitly
concatenated

Now, there are 3 entry points:
* `parse_string_literal`: Renamed from `parse_string`
* `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is
basically a string literal without the quotes
* `concatenate_strings`: Renamed from `parse_strings` but now it takes
the parsed nodes instead. So, we just need to concatenate them into a
single node.

> A short primer on `FStringMiddle` token: This includes the portion of
text inside the f-string that's not part of the expression and isn't an
opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the
`foo `, `.3f` and ` bar` are `FStringMiddle` token content.

***Discussion in the official implementation:
python/cpython#102855 (comment)

This change in the AST is when unicode strings (prefixed with `u`) and
f-strings are used in an implicitly concatenated string value. For
example,

```python
u"foo" f"{bar}" "baz" " some"
```

Pre Python 3.12, the kind field would be assigned only if the prefix was
on the first string. So, taking the above example, both `"foo"` and
`"baz some"` (implicit concatenation) would be given the `u` kind:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some', kind='u')
```

</p>
</details>

But, post Python 3.12, only the string with the `u` prefix will be
assigned the value:

<details><summary>Pre 3.12 AST:</summary>
<p>

```python
Constant(value='foo', kind='u'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='baz some')
```

</p>
</details>

Here are some more iterations around the change:

1. `"foo" f"{bar}" u"baz" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno', kind='u')
```

</p>
</details>

2. `"foo" f"{bar}" "baz" u"no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foo'),
FormattedValue(
  value=Name(id='bar', ctx=Load()),
  conversion=-1),
Constant(value='bazno')
```

</p>
</details>

3. `u"foo" f"bar {baz} realy" u"bar" "no"`

<details><summary>Pre 3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno', kind='u')
```

</p>
</details>

<details><summary>3.12</summary>
<p>

```python
Constant(value='foobar ', kind='u'),
FormattedValue(
  value=Name(id='baz', ctx=Load()),
  conversion=-1),
Constant(value=' realybarno')
```

</p>
</details>

With the hand written parser, we were able to provide better error
messages in case of any errors such as the following but now they all
are removed and in those cases an "unexpected token" error will be
thrown by lalrpop:
* A closing delimiter was not opened properly
* An opening delimiter was not closed properly
* Empty expression not allowed

The "Too many nested expressions in an f-string" was removed and instead
we can create a lint rule for that.

And, "The f-string expression cannot include the given character" was
removed because f-strings now support those characters which are mainly
same quotes as the outer ones, escape sequences, comments, etc.

1. Refactor existing test cases to use `parse_suite` instead of
`parse_fstrings` (doesn't exists anymore)
2. Additional test cases are added as required

Updated the snapshots. The change from `parse_fstrings` to `parse_suite`
means that the snapshot would produce the module node instead of just a
list of f-string parts. I've manually verified that the parts are still
the same along with the node ranges.

#7263 (comment)

fixes: #7043
fixes: #6835
dhruvmanila added a commit that referenced this pull request Sep 29, 2023
This PR updates the handling of disallowing non-parenthesized lambda
expr in f-strings.

Previously, the lexer was used to emit an empty `FStringMiddle` token in
certain cases for which there's no pattern in the parser to match. That would
then raise an unexpected token error while parsing.

This PR adds a new f-string error type `LambdaWithoutParentheses`. In
cases where the parser still can't detect the error, it's guaranteed to be
caught by the fact that there's no `FStringMiddle` token in the pattern.

Add test cases wherever we throw the `LambdaWithoutParentheses` error.

As this is the final PR for the parser, I'm putting the parser benchmarks here:

```
group                         fstring-parser                         main
-----                         --------------                         ----
parser/large/dataset.py       1.00      4.7±0.24ms     8.7 MB/sec    1.03      4.8±0.25ms     8.4 MB/sec
parser/numpy/ctypeslib.py     1.03   921.8±39.00µs    18.1 MB/sec    1.00   897.6±39.03µs    18.6 MB/sec
parser/numpy/globals.py       1.01     90.4±5.23µs    32.6 MB/sec    1.00     89.6±6.24µs    32.9 MB/sec
parser/pydantic/types.py      1.00  1899.5±94.78µs    13.4 MB/sec    1.03  1954.4±105.88µs    13.0 MB/sec
parser/unicode/pypinyin.py    1.03   292.3±21.14µs    14.4 MB/sec    1.00   283.2±13.16µs    14.8 MB/sec
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser python312 Related to Python 3.12
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for PEP 701 in the parser
2 participants